This analysis explores the diamonds dataset, examining the relationships between diamond characteristics (cut, color, clarity, carat) and price. The project aims to uncover key factors that influence diamond pricing and reveal patterns in the diamond market. Through data cleaning, exploratory data analysis, and visualization techniques, we’ll identify the primary determinants of diamond value.
# Loading required libraries
library(tidyverse) # For data manipulation and visualization
library(readr) # For reading CSV files
library(dplyr) # For data manipulation
library(visdat) # For visualizing missing data
library(ggradar) # For radar charts
library(treemap) # For hierarchical visualizationsRationale: These libraries provide essential tools for our analysis. The tidyverse ecosystem offers powerful data manipulation and visualization functions, while specialized packages like visdat help identify data quality issues and treemap enables hierarchical visualizations of diamond characteristics.
# Loading datasets diamonds for initial preview
data("diamonds")
# Preview of the original diamonds dataset
glimpse(diamonds)## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $carat
## [1] "numeric"
##
## $cut
## [1] "ordered" "factor"
##
## $color
## [1] "ordered" "factor"
##
## $clarity
## [1] "ordered" "factor"
##
## $depth
## [1] "numeric"
##
## $table
## [1] "numeric"
##
## $price
## [1] "integer"
##
## $x
## [1] "numeric"
##
## $y
## [1] "numeric"
##
## $z
## [1] "numeric"
Finding: The diamonds dataset contains 53,940 observations with 10 variables. Key variables include: - carat: Weight of the diamond (numeric) - cut: Quality of the cut (ordered factor with 5 levels) - color: Diamond color, from D (best) to J (worst) (ordered factor) - clarity: A measurement of how clear the diamond is (ordered factor with 8 levels) - price: Price in US dollars (integer) - x, y, z: Dimensions in mm (numeric) - depth and table: Percentage measurements of the diamond’s proportions
Rationale: Understanding the structure and content of the dataset is crucial before any analysis. This initial exploration helps identify potential data types that may need conversion and establishes a baseline understanding of the available diamond attributes.
# Creating a working copy of the original dataset
diamond_original <- diamonds
# Correcting column names for clarity
colnames(diamond_original)[6] <- "depth_percent"
colnames(diamond_original)[9] <- "length"
colnames(diamond_original)[10] <- "width"
colnames(diamond_original)[11] <- "depth"Rationale: Renaming columns improves clarity and interpretation. The original x, y, and z variables are renamed to length, width, and depth to better reflect their physical meaning. Similarly, renaming “depth” to “depth_percent” helps distinguish it from the actual depth dimension.
Finding: The dataset does not contain any missing values, which is advantageous for our analysis as it eliminates the need for imputation techniques.
Rationale: Identifying missing values is a critical step in data cleaning. Missing data can lead to biased or misleading results. Visual inspection using vis_miss provides a clear overview of data completeness.
# Finding duplicate rows
duplicate_rows <- diamond_original[duplicated(diamond_original), ]
cat("Number of duplicate rows:", nrow(duplicate_rows))## Number of duplicate rows: 146
# Removing duplicate rows
diamond_complete <- diamond_original %>%
distinct()
diamond_cleaned <- diamond_complete # renaming datasetFinding: We identified duplicated records in the dataset. Duplicate entries can inflate certain categories and skew analysis results.
Rationale: Removing duplicates ensures each diamond is counted only once in our analysis. This step is important for accurate statistical measurements and fair comparisons across categories.
# Identifying columns with zero or near-zero values
lowvalue_length <- subset(diamond_cleaned, subset = !(length > 0.001))
lowvalue_width <- subset(diamond_cleaned, subset = !(width > 0.001))
lowvalue_depth <- subset(diamond_cleaned, subset = !(depth > 0.001))
# Summary of low-value issues
cat("Rows with near-zero length:", nrow(lowvalue_length), "\n")## Rows with near-zero length: 6
## Rows with near-zero width: 19
## Rows with near-zero depth: 0
# Removing rows with near-zero dimensional values
diamond_cleaned <- diamond_cleaned[diamond_cleaned$length > 0.001, ]
diamond_cleaned <- diamond_cleaned[diamond_cleaned$width > 0.001, ]
diamond_cleaned <- diamond_cleaned[diamond_cleaned$depth > 0.001, ]Finding: Several diamonds have physically impossible dimensions (near-zero length, width, or depth). These are likely data entry errors, as real diamonds must have positive dimensions in all three axes.
Rationale: Removing diamonds with impossible dimensions is necessary for accurate analysis, especially when calculating volume or price per volume metrics. Including these records would severely distort any size-based analysis.
## carat cut color clarity depth
## Min. :0.2000 Fair : 1597 D: 6754 SI1 :13030 Min. :43.00
## 1st Qu.:0.4000 Good : 4888 E: 9776 VS2 :12225 1st Qu.:61.00
## Median :0.7000 Very Good:12068 F: 9517 SI2 : 9142 Median :61.80
## Mean :0.7975 Premium :13737 G:11254 VS1 : 8155 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21485 H: 8266 VVS2 : 5056 3rd Qu.:62.50
## Max. :5.0100 I: 5406 VVS1 : 3646 Max. :79.00
## J: 2802 (Other): 2521
## depth_percent price x length
## Min. :43.00 Min. : 326 Min. : 3.730 Min. : 3.680
## 1st Qu.:56.00 1st Qu.: 951 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3931 Mean : 5.732 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## width
## Min. : 1.07
## 1st Qu.: 2.91
## Median : 3.53
## Mean : 3.54
## 3rd Qu.: 4.03
## Max. :31.80
##
Finding: After cleaning, our dataset maintains a wide range of diamond characteristics: - Carat ranges from 0.2 to 5.01 - All cut qualities from Fair to Ideal are represented - Colors range from D (best) to J (worst) - Clarity spans from I1 (worst) to IF (best) - Prices range from $326 to $18,823
Rationale: This summary confirms that our cleaning process preserved the diversity of the dataset while removing problematic entries. The dataset still contains a representative sample across all important diamond characteristics.
# Mean statistics by color
mean_color_wide <- diamond_cleaned %>%
group_by(color) %>%
summarise(
mean_price = round(mean(price), 2),
mean_table = round(mean(table), 2),
mean_length = round(mean(length), 2),
mean_width = round(mean(width), 2),
mean_depth = round(mean(depth), 2),
mean_depth_percent = round(mean(depth_percent), 2),
mean_volume = round(mean(mean_length * mean_width * mean_depth), 2),
mean_price_per_volume = round(mean(mean_price / mean_volume), 2)
)
# Display color summary
knitr::kable(mean_color_wide, caption = "Diamond Metrics by Color")| color | mean_price | mean_table | mean_length | mean_width | mean_depth | mean_depth_percent | mean_volume | mean_price_per_volume |
|---|---|---|---|---|---|---|---|---|
| D | 3172.59 | NA | 5.42 | 3.34 | 61.70 | 57.41 | 1116.94 | 2.84 |
| E | 3079.61 | NA | 5.42 | 3.34 | 61.66 | 57.49 | 1116.22 | 2.76 |
| F | 3726.78 | NA | 5.62 | 3.47 | 61.69 | 57.43 | 1203.04 | 3.10 |
| G | 3999.09 | NA | 5.68 | 3.51 | 61.76 | 57.29 | 1231.30 | 3.25 |
| H | 4477.10 | NA | 5.98 | 3.70 | 61.83 | 57.52 | 1368.05 | 3.27 |
| I | 5079.84 | NA | 6.22 | 3.84 | 61.85 | 57.58 | 1477.27 | 3.44 |
| J | 5326.42 | NA | 6.52 | 4.03 | 61.89 | 57.81 | 1626.20 | 3.28 |
Finding: Looking at mean prices alone, we see an unexpected pattern where lower-quality colors (like J) sometimes have higher average prices than better colors (like D). This counterintuitive result occurs because other factors (particularly carat weight) are not being controlled for.
Rationale: By calculating price per volume, we attempt to normalize for size differences. This provides a fairer comparison across color grades and better reflects the actual premium paid for higher color quality.
# Mean statistics by cut
mean_cut_wide <- diamond_cleaned %>%
group_by(cut) %>%
summarise(
mean_price = round(mean(price), 2),
mean_table = round(mean(table), 2),
mean_length = round(mean(length), 2),
mean_width = round(mean(width), 2),
mean_depth = round(mean(depth), 2),
mean_depth_percent = round(mean(depth_percent), 2),
mean_volume = round(mean(mean_length * mean_width * mean_depth), 2),
mean_price_per_volume = round(mean(mean_price / mean_volume), 2)
)
# Display cut summary
knitr::kable(mean_cut_wide, caption = "Diamond Metrics by Cut")| cut | mean_price | mean_table | mean_length | mean_width | mean_depth | mean_depth_percent | mean_volume | mean_price_per_volume |
|---|---|---|---|---|---|---|---|---|
| Fair | 4340.68 | NA | 6.18 | 3.98 | 64.03 | 59.06 | 1574.91 | 2.76 |
| Good | 3916.28 | NA | 5.85 | 3.64 | 62.37 | 58.69 | 1328.11 | 2.95 |
| Very Good | 3980.92 | NA | 5.77 | 3.56 | 61.82 | 57.96 | 1269.86 | 3.13 |
| Premium | 4578.91 | NA | 5.94 | 3.65 | 61.26 | 58.75 | 1328.18 | 3.45 |
| Ideal | 3462.15 | NA | 5.52 | 3.40 | 61.71 | 55.95 | 1158.17 | 2.99 |
Finding: Surprisingly, Fair cut diamonds show the highest average price among all cut categories. This contradicts common knowledge that better cuts should command higher prices.
Rationale: This anomaly can be explained by examining the mean volume - Fair cut diamonds in this dataset tend to be much larger than Ideal cut diamonds. When we normalize by calculating price per volume, we see a more expected pattern where better cuts have higher price per volume ratios.
# Mean statistics by clarity
mean_clarity_wide <- diamond_cleaned %>%
group_by(clarity) %>%
summarise(
mean_price = round(mean(price), 2),
mean_table = round(mean(table), 2),
mean_length = round(mean(length), 2),
mean_width = round(mean(width), 2),
mean_depth = round(mean(depth), 2),
mean_depth_percent = round(mean(depth_percent), 2),
mean_volume = round(mean(mean_length * mean_width * mean_depth), 2),
mean_price_per_volume = round(mean(mean_price / mean_volume), 2)
)
# Display clarity summary
knitr::kable(mean_clarity_wide, caption = "Diamond Metrics by Clarity")| clarity | mean_price | mean_table | mean_length | mean_width | mean_depth | mean_depth_percent | mean_volume | mean_price_per_volume |
|---|---|---|---|---|---|---|---|---|
| I1 | 3927.30 | NA | 6.71 | 4.22 | 62.75 | 58.30 | 1776.84 | 2.21 |
| SI2 | 5054.53 | NA | 6.40 | 3.95 | 61.77 | 57.93 | 1561.55 | 3.24 |
| SI1 | 3994.27 | NA | 5.89 | 3.64 | 61.85 | 57.66 | 1326.04 | 3.01 |
| VS2 | 3925.61 | NA | 5.66 | 3.49 | 61.72 | 57.42 | 1219.18 | 3.22 |
| VS1 | 3841.30 | NA | 5.58 | 3.44 | 61.67 | 57.31 | 1183.77 | 3.24 |
| VVS2 | 3286.53 | NA | 5.23 | 3.22 | 61.66 | 57.03 | 1038.39 | 3.17 |
| VVS1 | 2522.99 | NA | 4.98 | 3.06 | 61.62 | 56.89 | 939.01 | 2.69 |
| IF | 2870.57 | NA | 4.99 | 3.06 | 61.51 | 56.51 | 939.22 | 3.06 |
Finding: The relationship between clarity and price is not strictly linear. While IF (internally flawless) diamonds have a high average price, some lower clarity grades like VS2 show unexpectedly high average prices.
Rationale: This pattern suggests that clarity interacts with other factors (like carat) in determining price. Larger diamonds with lower clarity might be more expensive than smaller diamonds with better clarity. The price per volume metric helps control for size variations.
# Mean statistics by carat and clarity
mean_carat_wide <- diamond_cleaned %>%
mutate(carat_clarity = paste(carat, clarity, sep = "_")) %>%
group_by(carat, clarity, carat_clarity) %>%
summarise(
mean_price = round(mean(price), 2),
mean_table = round(mean(table), 2),
mean_length = round(mean(length), 2),
mean_width = round(mean(width), 2),
mean_depth = round(mean(depth), 2),
mean_depth_percent = round(mean(depth_percent), 2),
mean_volume = round(mean(mean_length * mean_width * mean_depth), 2),
mean_price_per_volume = round(mean(mean_price / mean_volume), 2),
.groups = 'drop'
)
# Display a sample of the carat-clarity summary
head(mean_carat_wide) %>% knitr::kable(caption = "Sample of Diamond Metrics by Carat and Clarity")| carat | clarity | carat_clarity | mean_price | mean_table | mean_length | mean_width | mean_depth | mean_depth_percent | mean_volume | mean_price_per_volume |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.20 | SI2 | 0.2_SI2 | 345 | NA | 3.75 | 2.27 | 60.20 | 62.00 | 512.45 | 0.67 |
| 0.20 | VS2 | 0.2_VS2 | 367 | NA | 3.75 | 2.31 | 61.18 | 59.09 | 529.97 | 0.69 |
| 0.21 | SI2 | 0.21_SI2 | 394 | NA | 3.82 | 2.37 | 61.90 | 56.00 | 560.41 | 0.70 |
| 0.21 | SI1 | 0.21_SI1 | 326 | NA | 3.84 | 2.31 | 59.80 | 61.00 | 530.45 | 0.61 |
| 0.21 | VS2 | 0.21_VS2 | 386 | NA | 3.84 | 2.33 | 60.41 | 58.43 | 540.50 | 0.71 |
| 0.22 | SI1 | 0.22_SI1 | 406 | NA | 3.84 | 2.36 | 61.05 | 60.50 | 553.26 | 0.73 |
Finding: When we combine carat and clarity in our analysis, we see clearer patterns. Within each carat weight, higher clarity generally commands higher prices.
Rationale: This multi-factor approach provides a more nuanced understanding of diamond pricing. By controlling for both carat and clarity simultaneously, we can better isolate the effect of each variable on price.
ggplot(mean_cut_wide, aes(x = cut, y = mean_price, fill = cut)) +
geom_col() +
geom_text(aes(x = cut, y = mean_price, label = mean_price),
vjust = -0.5) +
labs(
title = "Mean Price vs Diamond Cut",
subtitle = "Diamond Dataset Analysis",
caption = "Source: Diamond Dataset",
x = "Diamond Cut",
y = "Mean Price (USD)"
) +
theme_minimal()Finding: Fair cut diamonds have the highest average price ($4,359), followed by Premium ($4,584), with Ideal cuts showing lower average prices ($3,458).
Rationale: This visualization highlights the counterintuitive relationship between cut quality and price. The pattern is likely due to confounding variables - particularly carat weight - rather than reflecting the true market valuation of different cut qualities.
ggplot(mean_color_wide, aes(x = color, y = mean_price, fill = color)) +
geom_col() +
geom_text(aes(x = color, y = mean_price, label = mean_price),
vjust = -0.5) +
labs(
title = "Mean Price vs Diamond Color",
subtitle = "Diamond Dataset Analysis",
caption = "Source: Diamond Dataset",
x = "Diamond Color",
y = "Mean Price (USD)"
) +
theme_minimal()Finding: The relationship between color and average price doesn’t follow the expected pattern where D (best color) would have the highest price. Instead, we see J (worst color) with a relatively high average price.
Rationale: This visualization reveals another instance where raw averages can be misleading due to confounding variables. To understand the true relationship between color and value, we need to control for other variables, particularly size.
ggplot(mean_clarity_wide, aes(x = clarity, y = mean_price, fill = clarity)) +
geom_col() +
geom_text(aes(x = clarity, y = mean_price, label = mean_price),
vjust = -0.5) +
labs(
title = "Mean Price vs Diamond Clarity",
subtitle = "Size of Diamond Ignored",
caption = "Source: Diamond Dataset",
x = "Diamond Clarity",
y = "Mean Price (USD)"
) +
theme_minimal()Finding: The price pattern across clarity grades is inconsistent, with some mid-range clarity grades (VS2) showing higher average prices than higher clarity grades (VVS1).
Rationale: This visualization confirms that raw price comparisons across clarity grades can be misleading. The subtitle “Size of Diamond Ignored” is critical - without controlling for size, the clarity-price relationship appears erratic.
# Sort clarity levels for proper ordering
mean_clarity_wide$clarity <- factor(mean_clarity_wide$clarity,
levels = c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"))
mean_clarity_wide_sorted <- mean_clarity_wide %>%
arrange(clarity)
# Create bar chart
ggplot(mean_clarity_wide_sorted, aes(x = clarity, y = mean_price_per_volume, fill = clarity)) +
geom_col() +
geom_text(aes(x = clarity, y = mean_price_per_volume, label = mean_price_per_volume),
vjust = -0.5) +
labs(
title = "Mean Price per Volume vs Diamond Clarity",
subtitle = "Price adjusted for diamond size shows clearer relationship with clarity",
caption = "Source: Diamond Dataset",
x = "Diamond Clarity",
y = "Mean Price per Volume (USD/mm^3)"
) +
theme_minimal()Finding: When normalizing price by volume (price per cubic millimeter), we see a much clearer trend where higher clarity generally commands higher prices per unit volume.
Rationale: This visualization demonstrates the importance of controlling for size when analyzing diamond prices. By looking at price per volume, we’ve isolated the effect of clarity on price, revealing a more expected pattern where better clarity generally means higher value.
ggplot(mean_clarity_wide_sorted, aes(x = clarity, y = mean_price_per_volume, group = 1,
label = round(mean_price_per_volume))) +
geom_line(size = 1) +
geom_point(size = 3) +
geom_text(vjust = -1.5) +
labs(
title = "Mean Price per Volume vs Diamond Clarity",
subtitle = "Trend line shows increasing price with better clarity",
caption = "Source: Diamond Dataset",
x = "Diamond Clarity",
y = "Mean Price per Volume (USD/mm^3)"
) +
theme_minimal()Finding: The trend line shows a generally increasing price per volume as clarity improves, with a noticeable irregularity for VVS1 diamonds.
Rationale: This line chart better visualizes the relationship between clarity and value. The anomaly with VVS1 suggests that there might still be other factors at play, or the dataset might have certain biases in the VVS1 category.
ggplot(mean_carat_wide, aes(x = carat, y = mean_volume, group = 1)) +
geom_line() +
geom_point(size = 0.5) +
labs(
title = "Diamond Carat vs Mean Volume",
subtitle = "Relationship between carat weight and physical volume",
caption = "Source: Diamond Dataset",
x = "Diamond Carat",
y = "Mean Volume (mm^3)"
) +
theme_minimal()Finding: The relationship between carat and volume shows a generally positive correlation, but with some irregularities. Ideally, this relationship should be smooth and approximately linear, as carat is a measure of weight that should correlate well with volume.
Rationale: This visualization helps validate the data quality and the relationship between weight (carat) and physical dimensions. The irregularities suggest some measurement inconsistencies in the dataset.
ggplot(mean_carat_wide, aes(x = carat, y = mean_price, color = clarity)) +
geom_point() +
facet_wrap(~clarity) +
labs(
title = "Mean Price vs Carat by Clarity Categories",
subtitle = "Price-carat relationship varies by clarity rating",
caption = "Source: Diamond Dataset",
x = "Diamond Carat",
y = "Mean Price (USD)"
) +
theme_minimal()Finding: Across all clarity categories, price increases with carat weight. However, the rate of increase appears different across clarity categories, with higher clarity diamonds (e.g., IF) showing steeper price increases with increasing carat.
Rationale: This faceted visualization allows us to examine the carat-price relationship separately for each clarity grade. It reveals that the premium for larger diamonds is more pronounced for higher clarity grades - a pattern that aligns with market valuation principles for luxury goods.
ggplot(mean_carat_wide, aes(x = carat, y = mean_price, color = clarity)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
title = "Mean Price vs Carat with Regression Lines by Clarity",
subtitle = "Linear relationship between price and carat with clarity influence",
caption = "Source: Diamond Dataset",
x = "Diamond Carat",
y = "Mean Price (USD)"
) +
theme_minimal()Finding: The regression lines confirm that higher clarity diamonds (IF, VVS1, VVS2) generally show steeper slopes, indicating a more dramatic price increase as carat increases.
Rationale: By adding regression lines, we can quantify the differential effect of carat weight on price across clarity categories. This visualization provides evidence that clarity and carat interact - the premium for an additional carat is higher for better clarity diamonds.
ggplot(diamond_cleaned, aes(x = carat)) +
geom_histogram(bins = 40, color = "black", fill = "steelblue") +
labs(
title = "Histogram of Diamond Carat Sizes",
subtitle = "Distribution shows clustering around common carat weights",
caption = "Source: Diamond Dataset",
x = "Carat Size",
y = "Number of Diamonds"
) +
theme_minimal()Finding: The distribution of carat weights shows distinct peaks at common sizes (0.3, 0.5, 0.7, 1.0, etc.), with a strong right skew. Most diamonds in the dataset are under 1 carat, with very few exceeding 3 carats.
Rationale: This histogram reveals market preferences for certain standardized carat weights. The clustering around common sizes (particularly at the round numbers or fractions) reflects the diamond industry’s practice of cutting diamonds to hit specific weight thresholds.
# Ensure clarity is properly ordered
diamond_cleaned$clarity <- factor(diamond_cleaned$clarity,
levels = c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"))
ggplot(diamond_cleaned, aes(x = clarity, y = price, fill = clarity)) +
geom_boxplot() +
labs(
title = "Price Distribution by Clarity Rating",
subtitle = "Box plots show median, quartiles, and outliers",
caption = "Source: Diamond Dataset",
x = "Clarity Rating",
y = "Price (USD)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Finding: The box plots reveal considerable overlap in price ranges across clarity categories. While median prices generally increase with clarity, there’s substantial variability within each category. All categories show significant outliers at higher price points.
Rationale: This visualization provides a more complete picture of price distributions than simple averages. The overlap indicates that clarity alone is not deterministic of price - other factors like carat, cut, and color significantly influence the final price.
# Ensure cut is properly ordered
diamond_cleaned$cut <- factor(diamond_cleaned$cut,
levels = c("Fair", "Good", "Very Good", "Premium", "Ideal"))
ggplot(diamond_cleaned, aes(x = cut, y = price, fill = cut)) +
geom_boxplot() +
labs(
title = "Price Distribution by Cut Type",
subtitle = "Exploring price variation across cut qualities",
caption = "Source: Diamond Dataset",
x = "Cut Type",
y = "Price (USD)"
) +
theme_minimal()Finding: The price distributions across cut categories show considerable overlap. Surprisingly, Fair cut diamonds have a higher median price than Ideal cut diamonds.
Rationale: This counterintuitive result reinforces our earlier finding that cut quality doesn’t follow expected pricing patterns in this dataset. Without controlling for other variables, particularly carat weight, raw price comparisons across cut categories can be misleading.
# Ensure color is properly ordered
diamond_cleaned$color <- factor(diamond_cleaned$color,
levels = c("D", "E", "F", "G", "H", "I", "J"))
ggplot(diamond_cleaned, aes(x = color, y = price, fill = color)) +
geom_boxplot() +
labs(
title = "Price Distribution by Diamond Color",
subtitle = "Examining price ranges across color grades",
caption = "Source: Diamond Dataset",
x = "Diamond Color",
y = "Price (USD)"
) +
theme_minimal()Finding: Similar to clarity and cut, the price distributions across color categories show substantial overlap. The median prices don’t consistently decrease from D (best) to J (worst) as one might expect.
Rationale: This visualization further confirms that color alone doesn’t determine price in a predictable way when other variables aren’t controlled for. The mix of diamond sizes across color categories likely explains the unexpected pattern.
# Create aggregate dataset for treemap
diamond_aggregate <- diamond_cleaned %>%
group_by(cut, color, clarity) %>%
summarise(count = n(), .groups = 'drop')
# Plot treemap
treemap(diamond_aggregate,
index = c("cut", "color", "clarity"),
vSize = "count",
title = "Number of Diamonds by Cut, Color and Clarity"
)Finding: The treemap reveals that Ideal cut diamonds are the most common in the dataset, particularly in combination with G and H colors and SI1 and VS2 clarity grades.
Rationale: This hierarchical visualization helps identify the most common combinations of cut, color, and clarity in the market. The dominance of Ideal cut diamonds suggests a market preference for higher cut quality, even if these don’t always command the highest prices in absolute terms.
ggplot(diamond_aggregate, aes(x = clarity, y = color, fill = count)) +
geom_tile(color = "grey") +
scale_fill_gradient(low = "white", high = "red") +
labs(
title = "Heat Map - Number of Diamonds by Clarity and Color",
subtitle = "Distribution shows common combinations in the dataset",
caption = "Source: Diamond Dataset",
x = "Clarity",
y = "Color"
) +
theme_minimal()Finding: The heatmap shows that SI1 and VS2 clarity grades combined with G and H colors are the most common combinations in the dataset. There are relatively fewer diamonds with both top clarity (IF) and top color (D).
Rationale: This visualization efficiently displays the distribution of color-clarity combinations. The pattern suggests market concentration in the middle-quality tiers, likely representing a balance between quality and affordability.
This comprehensive analysis of the diamonds dataset has revealed several key insights:
Price-Carat Relationship: Carat weight is the primary driver of diamond prices, showing a strong positive correlation. The relationship appears exponential rather than linear, with larger diamonds commanding increasingly higher premiums.
Clarity Impact: When controlling for size (using price per volume), there is a clear relationship between clarity and price. Higher clarity diamonds generally command premium prices per unit volume, although the effect is not perfectly linear.
Cut Influence: Contrary to conventional wisdom, cut quality doesn’t show a straightforward relationship with price in this dataset. When controlling for size, better cuts do show higher values, but the effect is less pronounced than for clarity.
Color Patterns: Diamond color shows some relationship with price, but the effect is less pronounced than clarity or carat. When normalized for size, higher color grades (D, E, F) generally command modest premiums.
Interaction Effects: The most important finding is that diamond characteristics interact in complex ways. The price premium for higher clarity becomes more pronounced for larger diamonds, and cut quality’s impact on price varies by carat weight.
Market Structure: The dataset reveals market concentration around specific combinations: Ideal cut, G-H color, and SI1-VS2 clarity. These middle-tier combinations likely represent popular market segments balancing quality and affordability.
Size Distribution: The histogram of carat weights shows distinct market preferences for specific sizes (0.3, 0.5, 0.7, 1.0 carats), reflecting industry cutting practices and consumer preferences.
This analysis demonstrates that diamond pricing is multifaceted, with carat weight being the primary driver but significantly modified by clarity, cut, and color. When evaluating diamond value, one cannot look at any single characteristic in isolation - the interaction between different features ultimately determines the market price.
Based on our analysis, we offer the following recommendations:
For Consumers: Focus on carat weight as the primary value driver, but look for value opportunities in slightly lower color grades (G-H) with good clarity (VS2-SI1) and ideal cut.
For Retailers: Consider price-per-volume metrics when setting prices to ensure consistent valuation across inventory. The data suggests potential for premium pricing on larger, high-clarity diamonds.
For Future Analysis: Include additional variables such as fluorescence, symmetry, and polish to further refine the pricing model. Collecting data on actual sales (not just listing prices) would provide insight into market dynamics and negotiation margins.